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Abstract 

The extremely high error rates reported by Keegan et al. in 'A platform-independent method for detecting errors In 
metagenomic sequencing data: DRISEE' (PLoS Comput Biol 20l2;8:el00254l) for many next-generation sequencing 
datasets prompted us to re-examine their results. Our analysis reveals that the presence of conserved artificial 
sequences, e.g. Illumlna adapters, and other naturally occurring sequence motifs accounts for most of the reported 
errors. We conclude that DRISEE reports inflated levels of sequencing error, particularly for lllumina data. Tools 
offered for evaluating large datasets need scrupulous reviev/ before they are implemented. 
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INTRODUCTION 

Error identification and correction in high-through- 
put sequencing datasets, especially at the single read 
level, have been addressed by many investigators 
[1—18]. Many approaches use platform-dependent 
quality scores, read consensus or k-mer analysis. 
Recently, Keegan et al. [19] described DRISEE, a 
method to assess quality of genomic and metage- 
nomic next-generation sequencing runs. The authors 
analysed numerous publicly available datasets with 
DRISEE and reported widely variable levels of 
sequencing errors, generally far higher than other 
published estimates [1, 20—22]. 

DRISEE bases its error estimates on variation from 
a consensus sequence in bins of artificially duplicated 
reads (ADRs). DRISEE assumes that prior to 
sequencing, over-amplification from a given start 
point in the template leads to formation of ADRs, 
and that sequencing error, not naturally occurring 
sequence diversity, accounts for sequence variation 
within an ADR bin. An ADR bin consists of all reads 
starting with an identical prefix, by default the first 
50 nt of the read. 



DRISEE as described might provide an improved 
method for estimating sequencing errors than the 
platform-based quality scores; however, the authors 
failed to carefully examine the origins of ADR bins. 
DRISEE analyses all reads except those that contain 
ambiguous bases. The authors correctly note, 'Bins 
can be screened for eukaryotic content, sequences 
with low complexity, and/ or known sequences that 
may exhibit an unusually high level of biological 
repetition (16s rRNA-based, sequences with low 
complexity, eukaryotic sequences etc.). Bins that con- 
tain such sequences should be excluded from further 
consideration'. However, the Supplemental Methods 
in the DRISEE manuscript reveal that the authors did 
not exclude such reads. 

Widespread lllumina adapter 
contamination 

We obtained from the NCBI Sequence Read 
Archive (SRA) the 12 metagenomic datasets that 
were used in the original publication to generate 
Figure 4b. DRISEE error estimation demonstrated 
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Contribution of liiumina Adapters to DRiSEE Error Estimation 




SRRCI61437 SRR061442 SRRD61443 SRR061444 SRRQ61446 SRR06144S SRR061451 SRR061453 SRRQ61454 SRR061455 SRR061457 SRR06145H 



Datasets 

■ lllumina Adapter Contaminated Reads lllumina Adapter Free Reads 

Figure I: Change in DRISEE error estimation for reads with and without lllumina adapter contamination for all 
12 datasets that were used in the original publication to demonstrate how DRISEE error profiles differ markedly 
from quality scores. 



a significant discrepancy from the quality scores 
reported by the lllumina platfomi. Our analysis of 
DRISEE-generated ADR bins with >20 reads 
showed that lUumina adapter sequences drive the 
formation of these bins. This 65-nt adaptor sequence 
usually occurs upstream of the sequencing primer- 
binding site. Unfortunately, lllumina adapter artifacts 
sometimes contaminate libraries. Unless they are fil- 
tered out or trimmed, reads starting with lUumina 
adapters will present identical 65-nt prefixes at the 
start of the read and create a spurious ADR bin. 
DRISEE interprets the actual biological variation 
that follows the adapter sequences in these bins as 
extensive sequencing error. 

With DRISEE (version 1.2), we re-analysed the 
12 datasets, identifying reads as 'adapter contami- 
nated' if they presented at least 15nt perfect identity 
to the lllumina adapter sequences in the first 50 nt 
(see Supplemental Methods). Figure 1 shows the 
marked difference in error estimation for reads 
with and without lUumina adapters. Although 
Keegan et al.\ [19] claim that the true error rates 
are higher than reported in the quality scores may 
be correct, the exceedingly high error rates presented 
in Figure 4b from the original publication reflect the 



presence of untrimmed lllumina adapter sequences 
and do not support their claims. 

Spurious ADR bins caused by adaptor sequences 
differ markedly from valid bins in the magnitude of 
errors and their distribution by nucleotide position. 
Figure 2 shows DRISEE output from individual 
large bins from dataset SRR061459. The adapter- 
generated bin exhibits error greater than zero at all 
positions following the prefix and the average error 
greatly exceeds that of the valid ADR bin. 

Low-complexity and conserved 
gene reads 

Analysis of all 10 lUumina genomic datasets, as weU 
as 10 randomly chosen lUumina metagenomic runs 
from Keegan etal.'% Figure 3 [19], detected significant 
lllumina adapter contamination and a high propor- 
tion of low-complexity reads in all datasets, both of 
which generated spurious bins that inflated DRISEE 
error drastically. Table 1 demonstrates the inflation 
of DRISEE error for one of these datasets chosen 
randomly (SRA accession SRR061488). Genes 
with conserved regions foUowed by biological vari- 
ation that commonly occurs in both bacterial and 
eukaryotic genomes can create bins large enough 
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Variation In DRISEE Error Based on Bin Type 



L 'V' 

■ _^ a 



1 1 1 1 1 1 n 1 11 1 1 1 1 1 1 1 m i l rmTm 1 1 uJ IzpjxijtfajJbinJ^^ 

a in ZD 3C] 40 50 70 SO 5D WO 

Position In the Reitd 

■ lllumlna Meixei-Cenera\eii Bin nvatld ADR ein 

Figure 2: DRISEE error by position. The largest bin contained 15264 reads and the prefix appeared to be a true 
ADR (bacterial genomic sequence). The per cent error at each position is plotted on the y-axis (light blue). Scores 
for an adapter-generated bin with 8177 reads are shown for comparison (dark red). 



99 TTGTATTGTTATTAACTCTTTCTTCAAATCGTAGTCCTTAAGAACAGTAT 
68 ACTTATAGTGATTACATAACACAATCATCTTCTATCAGTTGTATGACTAC 
64 TTATAGTGATTACATAACACAATCATCTTCTATCAGTTGTATGACTACAA 
62 ATAGAGACTTATAGTGATTACATAACACAATCATCTTCTATCAGTTGTAT 
58 TTGTGTTATGTAATCACTATAAGTCTCTATCGTAGAGTATAGAAGATTGA 
57 GTGATTACATAACACAATCATCTTCTATCAGTTGTATGACTACAACTGAT 
56 CGATAGAGACTTATAGTGATTACATAACACAATCATCTTCTATCAGTTGT 
54 GTGTTATGTAATCACTATAAGTCTCTATCGTAGAGTATAGAAGATTGAGT 
53 ATAGTGATTACATAACACAATCATCTTCTATCAGTTGTATGACTACAACT 
5 1 TATAGTGATTACATAACACAATCATCTTCTATCAGTTGTATGACTACAAC 
48 TACGATAGAGACTTATAGTGATTACATAACACAATCATCTTCTATCAGTT 
48 ATCTTCTATCAGTTGTATGACTACAACTGATTTCTTTTGGATACCCAAAA 
46 GATAGAGACTTATAGTGATTACATAACACAATCATCTTCTATCAGTTGTA 
46 ATTGTATTGTTATTAACTCTTTCTTCAAATCGTAGTCCTTAAGAACAGTA 
46 AGACTTATAGTGATTACATAACACAATCATCTTCTATCAGTTGTATGACT 
4 5 TTGTGATGTTATTAACTCTTTCTTCAAATCGTAGTCCTTAAGAACAGTAT 
44 ATTGTGTTATGTAATCACTATAAGTCTCTATCGTAGAGTATAGAAGATTG 
44 AGAGACTTATAGTGATTACATAACACAATCATCTTCTATCAGTTGTATGA 
42 TGATTACATAACACAATCATCTTCTATCAGTTGTATGACTACAACTGATT 
42 CTTATAGTGATTACATAACACAATCATCTTCTATCAGTTGTATGACTACA 



Figure 3: Some of the motifs that generated invalid bins for dataset 4441625.3. The first 20 largest bins are shown. 
The first column is bin size and the second is the 50-nt prefix. Similar motifs are shown using the same font colour 
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lable I: Change in DRISEE error estimation for 
SRR06I488 after removing adapter-contaminated and 
low-complexity bins from the analysis 



Category 


Number* 


DRISEE 

error (%) 


All bins 


4766 


39.9 


Adapter-contaminated bins 


1645 


45.3 


Low-complexity bins 


2718 


34.8 


Remaining bins 


403 


6.6 



'Number of bins containing >20 reads and no ambiguous bases in their 
prefixes. 



to be considered by DRISEE and inflate overall error 
estimations. For instance, 74 of 403 bins in 
SRR061488 derive from the 16S rRNA gene. 

Platform-specific error 

Keegan et al. [19] also report a striking difference in 
error rates between 454 and lUumina datasets. As we 
have shown, contaminating adapter sequences ac- 
count for much of the DRISEE error in lUumina 
datasets. We next analysed 55 of the 65 Roche/ 
454 metagenomic datasets used to generate Figure 
3 in the DRISEE manuscript (the other 10 datasets 
were not available in MG-RAST or SRA). 

Our analysis showed that while adapter contam- 
ination is rare in 454 data, the 50-nt prefixes from 34 
of the datasets were dominated by similar sequence 
motifs from sources we could not identify 
(see Supplemental Methods). Figure 3 exemplifies 
some of these motifs in one dataset (MG-RAST 
ID 4441625.3). Identical motifs in multiple datasets 
from the same research project suggest a library prep- 
aration artifact. Bins from another eight datasets had 
low-complexity, repetitive sequence prefixes. Whole 
genome amplification provided material for at least 
six of these Hbraries. Other datasets derived from 
metatranscriptomic material and contained a high 
proportion of rRNA-templated reads. The majority 
of the datasets used to compare the error rates of 
sequencing platforms in Figure 3 from the original 
publication violate underlying assumptions of 
DRISEE and led to publication of misleading results. 

Improving DRISEE 

Not all reads that share the same first 50 bases rep- 
resent artificial dupHcation. Meaningful results from 
DRISEE require understanding the source and dis- 
tribution of sequence sets with identical prefixes. 
Suspicious bins must be excluded. However, this 



adds a layer of complexity and might result in too 
few bins to reach a robust error estimate. The min- 
imum number of bins necessary to reach a reliable 
estimate and the impact of the sub-sampling neces- 
sary to complete the analysis in a reasonable time 
were not adequately addressed by the authors. 

Although DRISEE may eventually have the po- 
tential to identify problematic datasets and assess the 
sequencing quality of next-generation sequencing 
runs based on ADRs, the current version of the soft- 
ware is inadequate and its results are unrealistic. 

SUPPLEMENTARY DATA 

Supplementary data are available online at http;// 
bib . oxfordj ournals.org/. 
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